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ABSTRACT 


The China Conference on Knowledge Graph and Semantic Computing (CCKS) 2020 Evaluation Task 3 
presented clinical named entity recognition and event extraction for the Chinese electronic medical records. 
Two annotated data sets and some other additional resources for these two subtasks were provided for 
participators. This evaluation competition attracted 354 teams and 46 of them successfully submitted the 
valid results. The pre-trained language models are widely applied in this evaluation task. Data argumentation 
and external resources are also helpful. 


1. INTRODUCTION 


China Conference on Knowledge Graph and Semantic Computing (CCKS), which was founded in 2016, 
is organized by the Chinese Information Processing Society of China. To promote the development of 
technologies in knowledge graph and semantic computing, CCKS provides 8 evaluation tasks in 2020. 
Of these tasks, Task 3 focuses on named entity recognition (NER) and event extraction (EE) in the Chinese 
electronic medical records (EMRs). 


t Corresponding author: Jiangtao Zhang (Email: zhang-jt1 3 @tsinghua.org.cn; ORCID: 0000-0001 -8462-3915). 


202211.00390v1 


chinaXiv 


Overview of CCKS 2020 Task 3: Named Entity Recognition and Event Extractiomin Chinese HF] 
Electronic Medical Records 


EMRs are the core assets of a hospital. They are usually semi-structured data which contain lots of free 
text. Although a large amount of EMR data have been accumulated, most of them are not fully utilized. 
The difficulty of using free text blocks their usage. 


NER and EE are commonly used techniques to acquire useful information from free text. NER in EMRs 
is also known as clinical named entity recognition (CNER). We can recognize diseases, drugs or other 
medical entity names from EMRs with the help of the NER model. The most popular NER method is 
sequence labeling, which can be based on long short-term memory (LSTM) [1,2,3] or bidirectional encoder 
representation from transformers (BERT) [4]. Clinical event extraction helps us identify medical events in 
EMRs, such as the tumor site, the tumor size and where the tumor transfers to. LSTM, BERT and other 
methods are applied in EE. 


Traditional NER and EE are based on supervised models. However, the annotation of clinical information 
is much harder than the general domain information. Although there are some public medical data sets for 
the NER task, such as i2b2 [5], ShARe CLEF eHealth [6] and SemEval [7], there are barely public Chinese 
medical data sets. To promote the development of semantic analysis of the Chinese EMRs, the Knowledge 
Engineering Group of Tsinghua University and Yiducloud Beijing Technology Co., Ltd. organized this 
evaluation challenge at CCKS 2020. The data sets of this task provided by Yiducloud are restricted to CCKS 
evaluation only. 


2. RELATED WORK 


CCKS 2020 Task 3 focuses on NER and EE in the Chinese EMRs. NER and EE have been the core problems 
in natural language processing. 


2.1 Chinese NER 


NER is a task to locate and classify certain occurrences of words or expressions in unstructured text. In 
English NER, LSTM-CRF (Conditional Random Field) models [1,2,3] are a classic method to leverage both 
character-level and word-level representations, which can achieve the state-of-art results. Compared with 
NER in English, Chinese NER is more difficult since sentences in Chinese are not naturally segmented. A 
common practice for Chinese NER is to first perform word segmentation using an existing Chinese word 
segmentation (CWS) system and then apply a word-based NER model to infer the NER tags. However, the 
pipeline method suffers from error propagation, since the error of CWS may inevitably affect the performance 
of NER. Therefore, some approaches directly use a character-based NER model [8,9]. A drawback of the 
purely character-based NER model is that the word information is not fully exploited. To incorporate word 
information in Chinese NER, some recent methods, such as [10,11,12,13,14], resort to an automatically 
constructed lexicon. 
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2.2 Event Extraction 


Event is a common but non-negligible knowledge type. Therefore, identifying events from texts and 
extracting their arguments are important for many applications. DMCNN [15] is a classic EE model, which 
uses the convolutional neural network (CNN) method to learn semantic features from raw texts, including 
lexical-level and sentence-level features. JRNN [16] is a recurrent neural networks (RNNs) based method 
for EE, aiming to integrate the discrete features with the automatically learned features. JMEE [17] is a 
method based on graph convolution networks (GCNs), which jointly extracts multiple event triggers and 
arguments by introducing syntactic shortcut arcs to enhance information flow and using attention-based 
GCNs to model graph information. Recently, event extraction is explicitly casted as a machine reading 
comprehension (MRC) problem [18] and the MRC model is used to solve event extraction. 


2.3 NER and EE in Clinical Text 


The information extraction of clinical text is getting more and more important in recent years. The TREC 
is the first shared tasks in clinical natural language processing (NLP), which focus on identifying relevant 
and irrelevant documents. Other evaluation tasks inculde ImageCLEFmed [19] and i2B2 [5]. For solving 
clinical NER, LSTM units and a conditional random field classifier [20] are used in the NER component. 
An unsupervised method [21] is used to build clinical NER systems which do not require any manual 
annotations and the models are trained on automatically annotated corpus followed by self-training 
iterations. For EE in clinical text, the bi-directional long short-term memory network assisted by the attention 
mechanism [22] is utilized to uncover the important aspects of the patient’s medical conditions. 


3. TASK DESCRIPTION 
3.1 Clinical Named Entity Recognition 


Given the free text from EMRs, this task aims to identify the clinical entity mentions and classify them 
into pre-defined categories. A novel method is presented for training clinical NER systems that do not 
require any manual annotations. It only requires a raw text corpus and a resource like Unified Medical 
Language System (UMLS) that can give a list of named entities along with their semantic types. Using these 
resources, annotations are automatically obtained to train machine learning methods. The methods were 
evaluated on the NER shared-task data sets of i2b2 2010 and SemEval 2014. 


3.1.1 Formalized Definition 


We define this task formally. 


INPUT: 
1). A document collection from EMR: D = {d,,...dj,...,dy}, where d; = (Wj1,...,Win) 
2). A set of pre-defined categories: C = {cy,...,Cn} 
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OUTPUT: 

Collections of entity mention-category pairs: {(M, Cmi),--- (Mi; Cmr- (Mp Cp) }- 

The m, = (d, b, e) represent the entity mention in document d, where b; and e; is the start and end 
position of m, respectively. Cm; € C represents the category of m; The overlap between mentions is not 
allowed, which is e;< bı- 


3.1.2 Pre-defined Categories 


There are 6 categories that are defined as follows. 


1). Disease and diagnose (Dis) 

2). Imaging examination (ImgExam) 
3). Laboratory examination (LabExam) 
4). Operation 

5). Drug 

6). Anatomy 


3.2 Clinical Event Extraction 

3.2.1 Formalized Definition 
This task is formally defined as follows. 
INPUT: 
1). Event entity. 


2). A document collection from EMR: D = {d;,...,dy}, where d; = (Wi1,...,Win) 
3). A set of pre-defined attributes: P = {p,, Po, -- «Pm? 


OUTPUT: 

Collections of attribute entities: {[d;,(p;,(51,52,-..,5))]}, and 1 <i < N, 1 <j<m. 

The sx is the entity of attribute p; from document dj. There could be 0 or more than one entity for each 
attribute. 


3.2.2 Pre-defined Attributes 


The 3 pre-defined attributes are: 


1). Tumor Primary Site 
2). Tumor Size 
3). Tumor Metastatic Site 
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4. DATA SETS 


The data sets were provided by Yiducloud Beijing Technology Co., Ltd. Yiducloud organized a professional 
medical team to annotate these data. The data set is for CCKS evaluation only®. 


Compared with the CNER task in CCKS 2019, the annotated data set is about 4 times larger. Besides, 
Yiducloud provided an entity vocabulary and lots of unannotated data as additional resources that 
participators can use during the evaluation. The statistics of CNER data set are shown in Table 1. 


The clinical event extraction data set includes a labeled training set, an unlabeled set and a vocabulary, 
which makes this challenge closer to the real-word scene. The statistics of clinical event extraction data 
sets are shown in Table 2. 


Table 1. The statistics of clinical named entity recognition data set. 


Docs Dis ImgExam LabExam Operation Drug Anatomy Total 
Train 1500 6211 1490 1885 1327 2841 12660 26414 
Test 300 1361 270 251 221 942 2661 5706 
Unlabeled 1000 - - - - - - - 


Table 2. The statistics of clinical event extraction data set. 


Docs TumorPrimarySite TumorSize TumorMetastaticSite Total 
Train 1500 6211 1490 1885 1327 
Test 300 1361 270 251 221 
Unlabeled 1300 - - - - 


5. EVALUATION METRICS 
5.1 Clinical Named Entity Recognition 
5.1.1 Strict Metric 
There are two evaluation metrics, the strict metric and relaxed metric. The extracted entities set is denoted 
as S and the gold entities set is denoted as G. 


For the strict metric, s; €S is equal to g; € G, which means they are exactly the same: 


1). The start position of s; equals to g; 
2). The end position of s; equals to g; 
3). The category of s; equals to g. 


® To access the data sets, please contact the corresponding author after signing Data Usage Agreement. 


380 Data Intelligence 


202211.00390v1 


chinaXiv 


Overview of CCKS 2020 Task 3: Named Entity Recognition and Event Extractiomin Chinese HF] 
Electronic Medical Records 


The strict Precision, Recall and F1 can be calculated as follows: 


p-a (1) 
RSA (2) 
Fl, = A (3) 


5.1.2 Relaxed Metrics 


The relaxed metric does not require that s; €S and g; eG are exactly the same, and they only need to 
meet the following requirements: 


1). The maximum value of the start position of s; and g; is less or equal to the minimum value of the 
end position of s; and g; 
2). The category of s; is equal to g, 


The relaxed Precision, Recall and F1 can be calculated as follows: 


aa = 
R = — g 5) 
F1, = a (6) 


5.2 Clinical Event Extraction 


There could be more than one attribute entity for an event attribute. The Precision, Recall and F1 are 
calculated based on the attribute entity rather then attribute. 


6. RESULTS AND DISCUSSION 


This evaluation attracted 354 teams, and 46 of them successfully submitted their results. There are 32 
teams which submitted results and 5 evaluation papers on the clinical named entity recognition task. 
Fourteen teams submitted their results and 3 papers on the clinical event extraction task. We list the top 
teams in Table 3 and Table 4, respectively. 
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Table 3. The results of clinical named entity recognition. 


Rank Team Affiliation Score 
1 CASIA_Unisound CASIA&Unisound Al Technology Co., Ltd. 0.91564 
2 TMAIL Medical Al Lab, Tencent Holdings Ltd. 0.91541 
3 ywm Lantone 0.91242 
4 ChongChongChong HFUT&SCUT 0.90801 
5 SZU_IC Shenzhen University 0.90511 
6 mAI@pumc Peking Union Medical College 0.90461 

Table 4. The results of clinical event extraction. 

Rank Team Affiliation Score 
1 dst Knowledge Graph Group, Baidu, Inc. 0.76234 
2 TMAIL Medical Al Lab, Tencent Holdings Ltd. 0.74579 
3 LHJB National University of Defense Technology 0.73521 
4 araloak National University of Defense Technology 0.72730 
5 zhjohnchan Individual 0.71247 
6 cecbrain CEC Cloud Brain 0.67958 


6.1 Clinical Named Entity Recognition 


For the clinical named entity recognition task, Top 1 team and Top 2 team achieved very close scores. 
Both of them focus on the label inconsistency problem in CNER. 


Top 1 team comes from the Institute of Automation, Chinese Academy of Sciences (CA-SIA) and Unisound 
Al Technology Co., Ltd. They proposed a hybrid system composed of a semi-supervised noisy label learning 
model based on adversarial training and a rule based post-processing module. They adopted a five-fold 
cross-voting mechanism to handle the annotation inconsistency problem in the data set. They used model 
ensemble and semi-supervised training to alleviate the insufficient training data problem. They also applied 
adversarial training to decrease aleatoric uncertainty and epistemic uncertainty simultaneously. 


Based on the submitted papers, we have come to the following conclusions. 


1). The pre-trained language models (PLMs) like BERT or ELMO [23] are widely applied. Using PLMs 
in their models have been a common sense among participants. Most teams did not simply apply 
the general BERT model, but the model pre-trained on the Chinese documents, such as ROBERTa- 
wwm [24]. Furthermore, some of them collected Chinese medical documents and pre-trained PLMs 
on these in-domain documents. The usage of PLMs in this year’s evaluation challenge is more diverse 
than the previous competitions held at CCKS. 

2). Model ensemble. Most teams applied this technique in their submission. The ensembled models 
usually achieve better results than a single model. The two-stage and k-folder ensemble methods are 
effective. 
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3). 


Feature engineering and rules are still valuable. In the clinical domain, there are lots of regular 
patterns and less annotated data. Therefore, participants can benefit from feature engineering. Some 
of the teams added the features of Chinese words into their model and gained stable improvements. 
They also introduced some rules to alleviate the data noise. 

Semi-supervised methods. This evaluation provides 1,000 unlabeled data as additional resources. 
Some participants generated pseudo labels with a supervised model for the unlabeled data and 
trained the final model with both supervised and pseudo data. 


. Adversarial training. There are some unavoided label noises in the training data. To train a robust 


model not sensitive to the noises, some teams added turbulence to the word embeddings during 
training. 

Domain vocabulary. Vocabulary is usually an important resource for the CNER task. In the past CNER 
evaluation, participants usually collected and extended the clinical vocabularies in various ways. 
The most popular vocabularies include ICD-10, the DrugBank database and some health websites 
such as “haodf.com” and “xywy.com”. However, the top 3 teams in this year did not apply any 
vocabularies in their models. The main reason is their sufficient usage of PLMs. It may be a trend to 
replace vocabularies by PLMs. 


6.2 Clinical Event Extraction 


For the clinical event extraction task, Top 1 team achieved 0.76234 F1 score and Top 2 team achieved 


0.74597 F1 score. The competition is fierce. 


Top 1 team comes from Knowledge Graph Group, Baidu, Inc. They proposed a system mainly based on 


pre-trained language model. They applied domain adaption and task adaption during the pre-training, in 


order to improve the modeling ability of the pre-trained language model. To handle the insufficient training 


data challenge, they applied back translation to expand the training data. They also used entity vocabulary 


as the model input. 


Based on the submitted papers, we have come to the following conclusions. 


1). 


2). 


Pre-trained language models are widely used. Like the CNER task, Top 2 team applied PLMs in their 
models. Both of them chose RoBERTa [25] as the backbone. The usefulness of PLMs has been proved. 
Data argumentation. The annotation of clinical documents is very difficult. Therefore, there are no 
sufficient labeled data for training. The top teams tried various data argumentation methods to 
enhance the model’s robustness. Some of them doubled the training set by randomly re-arranging 
the sentence order in each training instance. Another team applied the back translation strategy to 
double the training set. They translated the training instance into English version and then back 
translated them into Chinese. Other teams tried random replacement of key information in the 
whole field. 
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3). Feature engineering and rules. Although the deep learning models are the key parts of the top teams’ 
models, all of them utilized feature engineering or rules in pre-processing and post-processing. The 
pre-processing mainly focuses on gaining cleaner data. Some teams also cut the documents into 
certain length to meet the requirements of their model. The post-processing rules are applied on the 
model outputs to filter the meaningless results. 


7. CONCLUSION 


This paper presents a detailed introduction of CCKS 2020 Task3 for clinical named entity recognition 
and clinical event extraction for Chinese EMRs. From the evaluation results, the participants achieved 
exciting performances, especially in the CNER task. The models are more varified in this year’s evaluation 
than in the previous year’s evaluation. Participants modeled the evaluation problems in different aspects. 
Through this evaluation, we hope there could be more researchers who focus on semantic analysis of the 
Chinese EMRs and more companies pay attention to the industrialization of Chinese EMRs. 
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